Omicron BA.2 (B.1.1.529.2): high potential to becoming the next dominating variant

The Omicron variant of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has rapidly replaced the Delta variant as a dominating SARS-CoV-2 variant because of natural selection, which favors the variant with higher infectivity and stronger vaccine breakthrough ability. Omicron has three lineages or subvariants, BA.1 (B.1.1.529.1), BA.2 (B.1.1.529.2), and BA.3 (B.1.1.529.3). Among them, BA.1 is the currently prevailing subvariant. BA.2 shares 32 mutations with BA.1 but has 28 distinct ones. BA.3 shares most of its mutations with BA.1 and BA.2 except for one. BA.2 is found to be able to alarmingly reinfect patients originally infected by Omicron BA.1. An important question is whether BA.2 or BA.3 will become a new dominating “variant of concern”. Currently, no experimental data has been reported about BA.2 and BA.3. We construct a novel algebraic topology-based deep learning model trained with tens of thousands of mutational and deep mutational data to systematically evaluate BA.2’s and BA.3’s infectivity, vaccine breakthrough capability, and antibody resistance. Our comparative analysis of all main variants namely, Alpha, Beta, Gamma, Delta, Lambda, Mu, BA.1, BA.2, and BA.3, unveils that BA.2 is about 1.5 and 4.2 times as contagious as BA.1 and Delta, respectively. It is also 30% and 17-fold more capable than BA.1 and Delta, respectively, to escape current vaccines. Therefore, we project that Omicron BA.2 is on its path to becoming the next dominating variant. We forecast that like Omicron BA.1, BA.2 will also seriously compromise most existing mAbs, except for sotrovimab developed by GlaxoSmithKline.


S1 Supplementary figures
provides the statistic analysis of BFE changes of RBD-ACE2 induced by mutations from Alpha, Beta, Gamma, Lambda, and Mu.

S2 Supplementary data
The Supplementary Data.zip contains four files as listed in the following subsection.

S2.1 Disrupted antibodies
File antibodies BFEs.csv shows the BFE changes of antibodies disrupted by Omicron mutations.

S3 Supplementary feature generation methods
In this section, the workflow of the deep learning-based BFE change predictions of protein-protein interactions induced by mutations for the present SARS-CoV-2 variant analysis and prediction will be firstly introduced, which includes four steps as shown in Figure S2: (1) training data preparation; (2) feature generations of protein-protein interaction complexes; and (3) prediction of protein-protein interactions by deep neural networks. Next, the validation of our machine learning-based model will be demonstrated, suggesting consistent and reliable results compared to the experimental deep mutations data.   Figure S1: The extension of Figure 3. Analysis of variant mutation-induced BFE changes of 185 antibody-RBD complexes. a1, b1, c1, d1 and e1 The distributions (counts) of accumulated BFE changes induced by Alpha, Beta, Gamma, Lambda, and Mu mutations respectively for 185 antibody-RBD complexes. For each case, there are more mutation-weakened complexes than mutation-strengthened complexes. a2, b2, c2, d2 and e2 The numbers of antibody-RBD complexes regarded as disrupted by Alpha, Beta, Gamma, Lambda, and Mu mutations respectively under different thresholds ranging from 0 kcal/mol, -0.3 kcal/mol, to <-3 kcal/mol.

S3.1 Methods for BFE change predictions
In this section, the process of the machine learning-based BFE change predictions is introduced. Once the data pre-processing and SNP genotyping are carried out, we will firstly proceed with the training data preparation process, which plays a key role in reliability and accuracy. A library of 185 antibodies and RBD complexes, as well as an ACE2-RBD complex, are obtained from Protein Data Bank (PDB). RBD mutation-induced BFE changes of these complexes are evaluated by the following machine learning model. According to the emergency and the rapid change of RNA virus, it is rare to have massive experimental BFE change data of SARS-CoV-2, while, on the other hand, next-generation sequencing data is relatively easy to collect. In the training process, the dataset of BFE changes induced by mutations of the SKEMPI 2.0 Figure S2: Illustration of genome sequence data pre-processing and BFE change predictions.
dataset [1] is used as the basic training set, while next-generation sequencing datasets are added as assistant training sets. The SKEMPI 2.0 contains 7,085 single-and multi-point mutations and 4,169 elements of that in 319 different protein complexes used for the machine learning model training. The mutational scanning data consists of experimental data of the binding of ACE2 and RBD induced mutations on ACE2 [2] and RBD [3,4], and the binding of CTC-445.2 and RBD with mutations on both protein [4].
Next, the feature generations of protein-protein interaction complexes are performed. The elementspecific algebraic topological analysis on complex structures is implemented to generate topological bar codes [5][6][7][8]. In addition, biochemistry and biophysics features such as Coulomb interactions, surface areas, electrostatics, et al., are combined with topological features [9]. The detailed information about the topologybased models will be demonstrated in subsection S3.2. Lastly, deep neural networks for SARS-CoV-2 are constructed for the BFE change prediction of protein-protein interactions [5]. The detailed descriptions of dataset and machine learning model are found in the literature [5,10,11] and are available at TopNetmAb.

S3.2.1 Topology features
Among all features generated for machine learning prediction, the application of topology theory makes the model to a whole new level. Those summarized as other inputs are called as auxiliary features and are described in Section S3.3.2 and S3.3.3. In this section, a brief introduction about the theory of topology will be discussed. Algebraic topology [6,7] has achieved tremendous success in many fields including biochemical and biophysical properties [8]. Special treatment should be implemented for biology applications to describe element types and amino acids in poly-peptide mathematically, which element-specific and site-specific persistent homology [10,12]. To construct the algebraic topological features on protein-protein interaction model, a series of element subsets for complex structures should be defined, which considers atoms from the mutation sites, atoms in the neighborhood of the mutation site within a certain distance, atoms from antibody binding site, atoms from antigen binding site, and atoms in the system that belong to type of {C, N, O}, A ele (E). Under the element/site-specific construction, simplicial complexes is constructed on point clouds formed by atoms. For example, a set of independent k + 1 points is from one element/site-specific set U = {u 0 , u 1 , ..., u k }. The k-simplex σ is a convex hull of k +1 independent points U , which is a convex combination of independent points. For example, a 0-simplex is a point and a 1-simplex is an edge. Thus, a m-face of the k-simplex with m+1 vertices forms a convex hull in a lower dimension m < k and is a subset of the k+1 vertices of a k-simplex, so that a sum of all its (k−1)-faces is the boundary of a k-simplex σ as where u 0 , ...,û i , ..., u k consists of all vertices of σ excluding u i . The collection of finitely many simplices is a simplicial complex. In the model, the Vietoris-Rips (VR) complex (if and only if B(u ij , r) ∩ B(u i j , r) = ∅ for j, j ∈ [0, k]) is for dimension 0 topology, and alpha complex (if and only if ∩ ui j ∈σ B(u ij , r) = ∅) is for point cloud of dimensions 1 and 2 topology [8].
The k-chain c k of a simplicial complex K is a formal sum of the k-simplices in K, which is c k = α i σ i , where α i is coefficients and is chosen to be Z 2 . Thus, the boundary operator on a k-chain c k is such that ∂ k : C k → C k−1 and follows from that boundaries are boundaryless as a sequence of complexes by boundary maps. Therefore, the Betti numbers are given as the ranks of kth homology group H k as β k = rank(H k ), where H k = Z k /B k , k-cycle group Z k and the k-boundary group B k . The Betti numbers are the key for topological features, where β 0 gives the number of connected components, such as number of atoms, β 1 is the number of cycles in the complex structure, and β 2 illustrates the number of cavities. This presents abstract properties of the 3D structure.
Finally, only one simplicial complex couldn't give the whole picture of the protein-protein interaction structure. A filtration of a topology space is needed to extract more properties. A filtration is a nested sequence such that Each element of the sequence could generate the Betti numbers {β 0 , β 1 , β 2 } and consequentially, a series of Betti numbers in three dimensions is constructed and applied to be the topological fingerprints in Figure S2.

S3.2.2 Residue-level features
Mutation site neighborhood amino acid composition Neighbor residues are the residues within 10Å of the mutation site. Distances between residues are calculated based on residue C α atoms. Six categories of amino acid residues are counted, which are hydrophobic, polar, positively charged, negatively charged, special cases, and pharmacophore changes. The count and percentage of the 6 amino acid groups in the neighbor site are regrading as the environment composition features of the mutation site. The sum, average, and variance of residue volumes, surface areas, weights, and hydropathy scores are used but only the sum of charges is included.

pKa shifts
The pKa values are calculated by the PROPKA software [13], namely the values of 7 ionizable amino acids, namely, ASP, GLU, ARG, LYS, HIS, CYS, and TYR. The maximum, minimum, sum, the sum of absolute values, and the minimum of the absolute value of total pKa shifts are calculated. We also consider the difference of pKa values between a wild type and its mutant. Additionally, the sum and the sum of the absolute value of pKa shifts based on ionizable amino acid groups are included.
Position-specific scoring matrix (PSSM) Features are computed from the conservation scores in the position-specific scoring matrix of the mutation site for the wild type and the mutant as well as their difference. The conservation scores are generated by PSI-BLAST [14].
Secondary structure The SPIDER2 software is used to compute the probability scores for residue torsion angle and residues being in a coil, alpha helix, and beta strand based on the sequences for the wild type and the mutant [15].

S3.2.3 Atom-level features
Seven groups of atom types, including C, N, O, S, H, all heavy atoms, and all atoms, are considered when generating the element-type features. Meanwhile, other three atom types, i.e., mutation site atoms, all heavy atoms, and all atoms, are used when generating the general atom-level features.
Surface areas Atom-level solvent excluded surface areas are computed by ESES [16].
Partial changes Partial change of each atom is generated by pdb2pqr software [17] using the Amber force field [18] for wild type and CHARMM force field [19] for mutant. The sum of the partial charges and the sum of absolute values of partial charges for each atomic group are collected.
Atomic pairwise interaction interactions Coulomb energy of the ith single atom is calculated as the sum of pairwise coulomb energy with every other atom as where k e is the Coulomb's constant, r ij is the distance of ith atom to jth atom, and q i is the charge of ith atom. The van der Waals energy of the ith atom is modeled as the sum of pairwise Lennard-Jones potentials with other atoms as where is the depth of the potential well, and r i is van der Waals radii.
In atomic pairwise interaction, 5 groups (C, N, O, S, and all heavy atoms) are counted both for Coulomb interaction energy and van der Waals interaction energy.
Electrostatic solvation free energy Electrostatic solvation free energy of each atom is calculated using the Poisson-Boltzmann equation via MIBPB [20] and are summed up by atom groups.

S4 Supplementary machine learning methods
The topology-based network model for BFE change predictions induced mutations on SARS-CoV-2 studying applies a deep neural network structure. Similar approaches have been widely implemented in the energy prediction of protein-ligand binding [21] and protein-protein interactions [12]. The neural network method maps an input feature layer to output layer and mimics biological brains for solving problems where numerous neuron units are involved and weights of neurons are updated by backpropagation methods. To make more complicated structure in order to extract abstract properties, more layers and more neurons in each layer can be constructed. In the training process, optimization methods are applied to avoid overfitting issue, such as dropout methods [22] that a partial of computed neurons of each layer is dropped. For the model cross validations, the Pearson correlation of 10-fold cross-validation is 0.864, and the root mean square error is 1.019 kcal/mol.

S4.1 Deep learning algorithms
A deep neural network is a neural network method with multi-layers (hidden layers) of neurons between the input and output layers. In each layer, the single neuron gets fully connected with the neurons in the next layer. It should preserve the consistency of all labels when applying the model for mutation-induced BFE change predictions. The loss function is constructed as follows: where N is the number of samples, f is a function of the feature vector x i parameterized by a weight vector W and bias term b, and λ represents a penalty constant.

S4.2 Optimization
The backpropagation is applied to evaluate the loss function starting from the output layer and propagates backward through the network structure to update the weight vector W and bias term b. Since gradient calculation is required, therefore, we apply the stochastic gradient descent method with momentum, which only evaluates a small part of training data and can be considered as calculating exponentially weighted averages, which is given as where W i is the parameters in the network, L(W i , b i ) is the objective function, η is the learning rate, X and y are the input and target of the training set, and β ∈ [0, 1] is a scalar coefficient for the momentum term. The momentum term involved accelerates the converging speed.

S5 Supplementary validation
In the main content, we briefly summarized validations of our machine learning predictions and experimental data. For large quantitative validations, we compared the BFE change prediction for mutations on S protein RBD to the experimental deep mutational enrichment data on RBD binding to human ACE2 and CTC-445.2 induced by RBD mutations [4,5,9]. To make these validations, we eliminated the experimental deep mutational enrichment data of RBD binding to human ACE2 and CTC-445.2 from the training sets and set them as testing sets, which have 1539 and 1500 samples, respectively. In the validation of RBD and CTC-445.2 complex, there is a very high correlation between the enrichment data and machine learning predictions, as well as the validation of RBD binding to ACE2, with Pearson correlations are 0.69 and 0.70, respectively. The deep mutational enrichment data can give a proportional descriptor of the affinity strength of protein-protein interactions induced by mutations. The machine learning methods, however, give stable and equalized evaluations, while experimental data might be different dramatically due to conditions and environments.
In addition, we compared our machine learning results with other experimental data, which are escape fraction, pseudovirus infection changes, and IC 50 fold changes [5]. In the comparison of 35 cases to experimental escape fractions on RBD binding to clinical trial antibodies induced by emerging mutations, our machine learning predictions have a Pearson correlation of 0.80. Especially, those high escaping mutations E484K and E484Q on LY-CoV555, and mutations K417T and K417N on LY-CoV016, are indicated by both our predictions and the experimental data [5]. We also use the pattern comparisons of our prediction to experimental data. Lastly, we collected experimental data from different literature [23][24][25][26]. According to variations from different research groups, they were summarized in increasing/decreasing patterns of emerging variant (including co-mutations) impacts on antibody therapies in clinical trials. In total, there are 20 pattern comparisons with an excellent agreement between various experimental data and our predictions, except for a minor discrepancy [5].